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1 Cache Memories 
Alan Jay Smith 

September 1982 ACM Computing Surveys (CSUR), volume 14 issue 3 

Full text available: ^ pdf(4.61 MB) Additional Information: full citation , references , citings , index terms 



Im ple me nt ing a cac he consistency protocol 

R. H. Katz, S. J. Eggers, D. A. Wood, C. L Perkins, R. G. Sheldon 

June 1985 ACM SIGARCH Computer Architecture News , Proceedings of the 12th 

annual international symposium on Computer architecture, volume 13 issue 3 
Full text available: S pdf(803.11 KB) Additional Information: full citation , citings , index terms 



Keywords: ownership-based protocols, shared bus multicomprocessor cache consistency, 
single chip implementation, snooping caches 



Area efficient architectures for informatio n integrit y in c ac he memories 
Seongwoo Kim, Arun K. Somani 

May 1999 ACM SIGARCH Computer Architecture News , Proceedings of the 26th 

annual international symposium on Computer architecture, volume 27 issue 2 
Full text available: J pdf( 227.09 KB ) , Additional Information: full citation , abstract , references , citings , index 
W Publisher Site ^rms 

Information integrity in cache memories is a fundamental requirement for dependable 
computing. Conventional architectures for enhancing cache reliability using check codes 
make it difficult to trade between the level of data integrity and the chip area requirement. 
We focus on transient fault tolerance in primary cache memories and develop new 
architectural solutions, to maximize fault coverage when the budgeted silicon area is not 
sufficient for the conventional configuration of an error checki ... 

Tan g o: a hardware-based data prefetching technique for superscalar p r ocess ors 
Shlomit S. Pinter, Adi Yoaz 

December 1996 Proceedings of the 29th annual ACM/IEEE international symposium on 
M ic roa rch itect u re 
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Full text available: ^ pdf(1.46 MB) Additional Information: full citation , abstract , references , citings , index 

terms 

We present a new hardware-based data prefetching mechanism for enhancing instruction 
level parallelism and improving the performance of superscalar processors. The emphasis in 
our scheme is on the effective utilization of slack time and hardware resources not used for 
the main computation. The scheme suggests a new hardware construct, the program 
progress graph (PPG), as a simple extension to the branch target buffer (BTB). We use the 
PPG for implementing a fast pre-program counter pre-PC, that ... 

Keywords: LRU mechanism, SPEC92 benchmark, Tango, base line architecture, branch 
target buffer, hardware resources, hardware-based data prefetching technique, instruction 
level parallelism, instruction prefetching, memory reference instructions, parallel 
processing, performance, program progress graph, simulation results, slack time, 
superscalar processors 



Piranha: a scalable architecture based on single-chip multiprocessing 

Luiz Andre Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz 

Qadeer, Barton Sano, Scott Smith, Robert Stets, Ben Verghese 

May 2000 ACM SIGARCH Computer Architecture News , Proceedings of the 27th 

annual international symposium on Computer architecture, volume 28 issue 2 

Additional Information: full citation, abstract , references, citin gs, index 



Full text available: r ._. v . . . _ . 

^ terms 

The microprocessor industry is currently struggling with higher development costs and 
longer design times that arise from exceedingly complex processors that are pushing the 
limits of instruction-level parallelism. Meanwhile, such designs are especially ill suited for 
important commercial applications, such as on-line transaction processing (OLTP), which 
suffer from large memory stall times and exhibit little instruction-level parallelism. Given 
that commercial applications constitute by fa ... 

Recovery protocols for shared memory database systems 
Lory D. Molesky, Krithi Ramamritham 

May 1995 ACM SIGMOD Record , Proceedings of the 1995 ACM SIGMOD international 

conference on Management of data, volume 24 issue 2 
Full text available: ^pdf(1.65 MB) Additional Information: full citation, abstract , references, index terms 

Significant performance advantages can be gained by implementing a database system on a 
cache-coherent shared memory multiprocessor. However, problems arise when failures 
occur. A single node (where a node refers to a processor/memory pair) crash may require a 
reboot of the entire shared memory system. Fortunately, shared memory multiprocessors 
that isolate individual node failures are currently being developed. Even with these, because 
of the side effects of the cache coherency protocol, ... 

4.2BSD and 4.3BSD as examples of the UNIX system 
John S. Quarterman, Abraham Silberschatz, James L. Peterson 
December 1985 ACM Computing Surveys (CSUR), volume 17 issue 4 

Full text available: jp l pdf(4.07 MB) AdditionaI lnformation: ^citaJion, abstLact, references, dtjogs, index 
^ terms , review 

This paper presents an in-depth examination of the 4.2 Berkeley Software Distribution, 
Virtual VAX-11 Version (4.2BSD), which is a version of the UNIX Time-Sharing System. 
There are notes throughout on 4.3BSD, the forthcoming system from the University of 
California at Berkeley. We trace the historical development of the UNIX system from its 
conception in 1969 until today, and describe the design principles that have guided this 
development. We then present the internal data structures and ... 
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8 Verification techniques for cache coherence protocols 
Fong Pong, Michel Dubois 

March 1997 ACM Computing Surveys (CSUR), volume 29 issue l 

Full text available:® ndfll 25 MB) Additional ,nformation: Mcitaflgn, abstract, references, citings, index 
^ terms 

In this article we present a comprehensive survey of various approaches for the verification 
of cache coherence protocols based on state enumeration, (symbolic model checking, and 
symbolic state models. Since these techniques search the state space of the protocol 
exhaustively, the amount of memory required to manipulate that state information and the 
verification time grow very fast with the number of processors and the complexity of the 
protocol mechanism ... 

Keywords: cache coherence, finite state machine, protocol verification, shared-memory 
multiprocessors, state representation and expansion 



9 Owner prediction for accelerating c ac he-to -cache transfer m isses in a cc-NUMA 
architecture 

Manuel E. Acacio, Jose Gonzalez, Jose M. Garcia, Jose Duato 

November 2002 Proceedings of the 2002 ACM/IEEE conference on Supercomputing 

Full text available: H pdf(120.57 KB) Additional Information: fuJIcLaibn, abstract, references, citings, index 

term s 

Cache misses for which data must be obtained from a remote cache (cache-to-cache 
transfer misses) account for an important fraction of the total miss rate. Unfortunately, cc- 
NUMA designs put the access to the directory information into the critical path of 3-hop 
misses, which significantly penalizes them compared to SMP designs. This work studies the 
use of owner prediction as a means of providing cc-NUMA multiprocessors with a more 
efficient support for cache-to-cache transfer misses. Our propo ... 

10 Don't use the page number, but a pointer to it 
Andre Seznec 

May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd 

annual international symposium on Computer architecture, volume 24 issue 2 

Full text available- ®Ddf(1 26 MB) Additional Information: full citation, abstract, references, citings, index 
• y^u.. ~ terms 

Most newly announced high performance microprocessors support 64-bit virtual addresses 
and the width of physical addresses is also growing. As a result, the size of the address tags 
in the LI cache is increasing. The impact of on chip area is particularly dramatic when small 
block sizes are used. At the same time, the performance of high performance 
microprocessors depends more and more on the accuracy of branch prediction and for 
reasons similar to those in the case of caches the size of the Br ... 

Keywords: address width, indirect-tagged caches, reduced branch target buffers, tag 
implementation cost 



11 Performance of database workloads on shared-memory systems with out-of-order 
processors 

Parthasarathy Ranganathan, Kourosh Gharachorloo, Sarita V. Adve, Luiz Andre Barroso 
October 1998 Proceedings of the eighth international conference on Architectural 

support for programming languages and operating systems, volume 33 , 32 

Issue 11,5 

Full text available: 4g| pdf(1.62 MB) Additional Information: full citation , abstract , references , citings, index 

terms 
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Database applications such as online transaction processing (OLTP) and decision support 
systems (DSS) constitute the largest and fastest-growing segment of the market for 
multiprocessor servers. However, most current system designs have been optimized to 
perform well on scientific and engineering workloads. Given the radically different behavior 
of database workloads (especially OLTP), it is important to re-evaluate key system design 
decisions in the context of this important class of applicatio ... 

12 Hardware-assisted replay of multiprocessor programs 
David F. Bacon, Seth Copen Goldstein 

December 1991 ACM SIGPLAN Notices , Proceedings of the 1991 ACM/ONR workshop 

on Parallel and distributed debugging, volume 26 issue 12 
Full text available: pdf(1.20 MB) Additional Information: full citation , references , citings , index terms 



13 Using d estination-set prediction to improve the latency/band width tradeoff in shared- 
memory multiprocessors 

Milo M. K. Martin, Pacia J. Harper, Daniel J. Sorin, Mark D. Hill, David A. Wood 

May 2003 ACM SIGARCH Computer Architecture News , Proceedings of the 30th 

annual international symposium on Computer architecture, volume 31 issue 2 
Full text available: ^ pdf(220.76 KB) Additional Information: full citation , abstract , references , citings 

Destination-set prediction can improve the latency/bandwidth tradeoff in shared-memory 
multiprocessors. The destination set is the collection of processors that receive a particular 
coherence request Snooping protocols send requests to the maximal destination set (i.e., 
all processors), reducing latency for cache-to-cache misses at the expense of increased 
traffic. Directory protocols send requests to the minimal destination set, reducing bandwidth 
at the expense of an indirection through the d ... 

14 Compiler and hardware support for cache coherence in large-scale multiprocessors: | 
design considerations and performan ce study 

Lynn Choi, Pen-Chung Yew 

May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd 

annual international symposium on Computer architecture, volume 24 issue 2 
Full text available* S^dfQ 48 MB). Additional Information: full citation , abstract , references , citings , index 
^ terms 

In this paper, we study a hardware-supported, compiler directed (HSCD) cache coherence 
scheme, which can be implemented on a large-scale multiprocessor using off-the-shelf 
microprocessors, such as the Cray T3D. It can be adapted to various cache organizations, 
including multi-word cache lines and byte-addressable architectures. Several system related 
issues, including critical sections, inter-thread communication, and task migration have also 
been addressed. The cost of the required hardware sup ... 

1 5 S peculative synchronization: ap plying thread-level speculation to explicitly parallel | 
applications 

Jose F. Martinez, Josep Torrellas 

October 2002 Proceedings of the 10th international conference on Architectural 

support for programming languages and operating systems, volume 36 , 30 , 

37 Issue 5 , 5 , 10 

Full text available: *^ pdf(1,49 MB) Additional Information: full citation , abstract, references , citin gs 

Barriers, locks, and flags are synchronizing operations widely used programmers and 
parallelizing compilers to produce race-free parallel programs. Often times, these 
operations are placed suboptimally, either because of conservative assumptions about the 
program, or merely for code simplicity. We propose Speculative Synchronization, which 
applies the philosophy behind Thread-Level Speculation (TLS) to explicitly parallel 
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applications. Speculative threads execute past active barriers, busy ... 

16 Informin g memory operations: memory performance feedback mechanisms and their 
a pplications 

Mark Horowitz, Margaret Martonosi, Todd C. Mowry, Michael D. Smith 

May 1998 ACM Transactions on Computer Systems (TOCS), volume 16 issue 2 

Full text available* ISl pdf (344.74 KB) Additional Information: full citation , abstract , references , citings , index 
' terms , review 

Memory latency is an important bottleneck in system performance that cannot be 
adequately solved by hardware alone. Several promising software techniques have been 
shown to address this problem successfully in specific situations. However, the generality of 
these software approaches has been limited because current architecturtes do not provide a 
fine-grained, low-overhead mechanism for observing and reacting to memory behavior 
directly. To fill this need, this article proposes a new class ... 

Keywords: cache miss notification, memory latency, processor architecture 



17 A high-level abstraction of shared accesses 
Peter J. Keleher 

February 2000 ACM Transactions on Computer Systems (TOCS), volume 18 issue l 

Full text available: a D df(183.57 KB) Addltional Information: full citation , abstract, references , index terms , 

review 

We describe the design and use of the tape mechanism, a new high-level abstraction of 
accesses to shared data for software DSMs. Tapes consolidate and generalize a number of 
recent protocol optimizations, including update-based locks and recorded-replay barriers. 
Tapes are usually created by "recording" shared accesses. The resulting recordings can be 
used to anticipate future accesses by tailoring data movement to application semantics. 
Tapes-based mechanisms a ... 

Keywords: DSM, programming libraries, shared memory, update protocols 



18 Performance characterization of a Quad Pentium Pro SMP using OLTP workloads 
Kimberly Keeton, David A. Patterson, Yong Qiang He, Roger C. Raphael, Walter E. Baker 
April 1998 ACM SIGARCH Computer Architecture News , Proceedings of the 25th 

annual international symposium on Computer architecture, volume 26 issue 3 
Full text available: ^ pd ^ 1 5Q MBjjIP Additional Information: full citation , abstract , references , citings , index 
Publish er Site *SOD§ 

Commercial applications are an important, yet often overlooked, workload with significantly 
different characteristics from technical workloads. The potential impact of these differences 
is that computers optimized for technical workloads may not provide good performance for 
commercial applications, and these applications may not fully exploit advances in processor 
design. To evaluate these issues, we use hardware counters to measure architectural 
features of a four-processor Pentium Pro-based se ... 

19 An effective on-chip preloadin g scheme to reduce data access penalt y 
Jean-Loup Baer, Tien-Fu Chen 

August 1991 Proceedings of the 1991 ACM/IEEE conference on Supercomputing 

Full text available: 1|j) pdf(1.18 MB) Additional Information: full citation , references , citings , index terms 
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20 Survey of commercial parallel machines 
Gowri Ramanathan, Joel Oren 

June 1993 ACM SIGARCH Computer Architecture News, volume 21 issue 3 

Full text available: ^ pdf(1.64 MB) Additional Information: full citation , abstract , citin gs, index terms 

We have presented in this paper the survey of the parallel machines that are marketed 
today. The survey includes the latest machines available from Kendell Square Research, 
Thinking Machines Corporation, MasPar Computer Corporation, NCUBE Corporation, 
Sequent Computer Systems and Parsytec. We have provided the topology, architecture, 
cache coherence, synchronization and performance in MFLOPs for each of the machines 
subject to the availability of information. 
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21 Hardware-software trade-offs in a direct Rambus implementation of the RAMpage 
memory hierarchy 

Philip Machanick, Pierre Salverda, Lance Pompe 

October 1998 Proceedings of the eighth international conference on Architectural 

support for programming languages and operating systems, volume 32 , 33 

Issue 5 , 11 

Additional Information: full citation , abstract , references, citings, index 



Full text available: 1§§ pdf(1 .47 MB ) 



terms 



The RAMpage memory hierarchy is an alternative to the traditional division between cache 
and main memory: main memory is moved up a level and DRAM is used as a paging 
device. The idea behind RAMpage is to reduce hardware complexity, if at the cost of 
software complexity, with a view to allowing more flexible memory system design. This 
paper investigates some issues in choosing between RAMpage and a conventionalcache 
architecture, with a view to illustrating trade-offs which can be made in choosi ... 



Full text available: |§ pdf(2.55 MB) 



22 The V distributed system \ 
David Cheriton 

March 1988 Communications of the ACM, volume 31 issue 3 

Additional Information: full citation , abstract , references , citin gs, index 
terms , review 

The V distributed System was developed at Stanford University as part of a research project 
to explore issues in distributed systems. Aspects of the design suggest important directions 
for the design of future operating systems and communication systems. 

23 Principles of transaction-oriented database recovery ! 
Theo Haerder, Andreas Reuter 

December 1983 ACM Computing Surveys (CSUR), volume is issue 4 

Full text available: jjjl pdf(2.48 MB) Additional Information: full citation , references , citin gs, index terms , review 



24 Functional verification of a multiple-issue, out-of-order superscalar Alpha processor— §|| 
the DEC Alpha 21264 microprocessor 

Scott Taylor, Michael Quinn, Darren Brown, Nathan Dohm, Scot Hildebrandt, James Huggins, 
Carl Ramey 
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May 1998 Proceedings of the 35th annual conference on Design automation - Volume 
00 

Full text available: ^ pdf(1 53.68 KB) Additional Information: full citation , abstract , references , citings , index 

' Publisher Site te™s 

DIGITAL'S Alpha 21264 processor is a highly out-of-order, superpipelined, superscalar 
implementation of the Alpha architecture, capable of a peak execution rate of six 
instructions per cycle and a sustainable rate of four per cycle. The 21264 also features a 
500 MHz clock speed and a high-bandwidth system interface that channels up to 5.3 
Gbytes/second of cache data and 2.6 Gbytes/second of main-memory data into the 
processor. Simulation-based functional verification was performed on the lo ... 

Keywords: 21264, Alpha, architecture, coverage anaysis, microprocessor, pseudo-random, 
validation, verification 



25 rm done simulating; now what? Verification coverage analysis and correctness 
checking of the DEC chip 21164 Alpha microprocessor 
Michael Katrowitz, Lisa M. Noack 

June 1996 Proceedings of the 33rd annual conference on Design automation 

Full text available: ^pdf(1 17.86 KB ) Additional Information: full citation , references , citings, index terms 



26 The HP AutoRAID hierarchical storage system 
J. Wilkes, R. Golding, C. Staelin, T. Sullivan 

December 1995 ACM SIGOPS Operating Systems Review , Proceedings of the fifteenth 

ACM symposium on Operating systems principles, volume 29 issue 5 
Full text available: ^pd f ( 1,6 Q M B ) Additional Information: full citation, references , citings , index terms 



27 The HP AutoRAID hierarchical stora ge s ystem 
John Wilkes, Richard Golding, Carl Staelin, Tim Sullivan 

February 1996 ACM Transactions on Computer Systems (TOCS), Volume 14 issue i 

Full text available: fiD pdf(1 82 MB) Additional Information: full citation , abstract , references , citings, index 
l^j : terms 

Configuring redundant disk arrays is a black art. To configure an array properly, a system 
administrator must understand the details of both the array and the workload it will 
support. Incorrect understanding of either, or changes in the workload over time, can lead 
to poor performance. We present a solution to this problem: a two-level storage hierarchy 
implemented inside a single disk-array controller. In the upper level of this hierarchy, two 
copies of active data are stored to provide f ... 

Keywords: RAID, disk array, storage hierarchy 



28 A forward-looking method of Cache memory control 
J. K. Iliffe 

September 1987 ACM SIGARCH Computer Architecture News, volume is issue 4 
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Shreekant S. Thakkar, Mark Sweiger 

May 1990 ACM SIGARCH Computer Architecture News , Proceedings of the 17th 

annual international symposium on Computer Architecture, volume is issue 3 
Full text available- Spdf(1.05 MB) Additional Information: full citation , abstract , references , citings , index 
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Sequenfs Symmetry Series is a bus-based shared-memory multiprocessor. System 
performance in an OLTP relational database application was investigated using the TP1 
benchmark. System performance was tested with fully-cached benchmarks and with scaled 
benchmarks. In fully-cached tests, the entire database fits inside main memory. In scaled 
tests, the database is larger than available memory. In the fully-cached benchmark, 
performance was initially limited by bus saturation. The cause was the ... 

30 The verification of cache coherence protocols 
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August 1993 Proceedings of the fifth annual ACM symposium on Parallel algorithms 
and architectures 
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